Intelligent personalized shopping recommendation using clustering and supervised machine learning algorithms

Next basket recommendation is a critical task in market basket data analysis. It is particularly important in grocery shopping, where grocery lists are an essential part of shopping habits of many customers. In this work, we first present a new grocery Recommender System available on the MyGroceryTour platform. Our online system uses different traditional machine learning (ML) and deep learning (DL) algorithms, and provides recommendations to users in a real-time manner. It aims to help Canadian customers create their personalized intelligent weekly grocery lists based on their individual purchase histories, weekly specials offered in local stores, and product cost and availability information. We perform clustering analysis to partition given customer profiles into four non-overlapping clusters according to their grocery shopping habits. Then, we conduct computational experiments to compare several traditional ML algorithms and our new DL algorithm based on the use of a gated recurrent unit (GRU)-based recurrent neural network (RNN) architecture. Our DL algorithm can be viewed as an extension of DREAM (Dynamic REcurrent bAsket Model) adapted to multi-class (i.e. multi-store) classification, since a given user can purchase recommended products in different grocery stores in which these products are available. Among traditional ML algorithms, the highest average F-score of 0.516 for the considered data set of 831 customers was obtained using Random Forest, whereas our proposed DL algorithm yielded the average F-score of 0.559 for this data set. The main advantage of the presented Recommender System is that our intelligent recommendation is personalized, since a separate traditional ML or DL model is built for each customer considered. Such a personalized approach allows us to outperform the prediction results provided by general state-of-the-art DL models.


Introduction
Grocery shopping is a common activity that involves several important factors such as time, budget, and purchasing pressure [1]. In this context, well-conceived grocery lists can be an efficient planning and budgeting tool. Several studies have indicated that a majority of modern customers rely on a written, mental, or digital grocery list [2,3] in order to assist them in their shopping. Furthermore, the same studies have also revealed that consumers generally had growing interest in applications that helped them interactively manage their grocery lists, while informing them about products prices and special offers.
Typically, grocery retailers propose new specials every week to attract new customers and improve sales and profits. For example, Walters and Jamil [4] have shown that in a regular grocery shopping trip involving cross-category products, 39% of the items in a customer's basket were special offers. These authors also concluded that about 30% of surveyed customers were highly influenced by different coupons and specials.
While special prices sometimes allow customers to make significant savings, thousands of them are usually released every week, often leading to huge information overload. This makes the task of selecting the most advantageous offers for a given customer an extremely challenging one [5].
With the development of online shopping, recent advancements in machine learning techniques, and favorable reactions of many customers to user-friendly applications aiming at improving their shopping experience, the development of an online recommender grocery shopping system able to provide valuable individual recommendations seems to be a very relevant task. MyGroceryTour (http://mygrocerytour.ca) is a good example of such a recommender system. MyGroceryTour is a Canadian shopping database and website that allows users to manage their grocery lists based on available weekly promotions in most major grocery stores located in their area [6].
One of the main purposes of our study is to present a new ML-based recommender system for grocery shopping based on the MyGroceryTour users' purchase histories, profiles, preferences and available weekly specials in order to assist them in creating cost-effective personalized weekly grocery lists (see Fig 1).
Our main contributions are the following: 1. We present our novel personalized Recommender System available on the MyGroceryTour platform; 2. We perform a clustering analysis to partition Canadian customers into non-overlapping clusters according to their grocery shopping habits; including the application of clustering, traditional machine learning, and deep learning algorithms. Our main results are then described and discussed in the Results and Discussion section, which is followed by our main conclusions.

Related work
Recommender systems (RS) [7] have been an increasingly important field of study since the first research papers on Collaborative Filtering in the mid-90s [8][9][10], and the expansion of ecommerce and online shopping [11]. Recommender systems include algorithms and software aiming at providing users with personalized items recommendations to help them overcome the data overload issue and to assist them in decision-making processes. The recommended items represent the output of the recommender system while their nature may vary depending on the context; among others, the items can include movies, songs, retail products, or online documents [7,12]. Nowadays, several strategies to build recommender systems have been described in the literature. Here, we present the most popular of them, and those related to our case of study.
Collaborative filtering is one of the most popular and efficient RS techniques [14,15]. It is based on the word-of-mouth concept and admits that a user trusts another user with similar reasoning and taste. It also makes the assumptions that two similar users have similar interests, and that two similar items have similar ratings [16]. The most common limitations faced by CB methods are the cold start and sparse matrix issues [17]. The cold start issue is characterized by the lack of initial information regarding a newly introduced user or item, whereas the sparse matrix issue typically occurs when a given user tends to interact with a few items only out of the massive amount of available products [12,18,19].
Content-based filtering, on the other hand, tends to recommend items whose features and characteristics are similar to other items in which a given user showed positive interest in the past [20]. This approach requires the use of metadata relative to each considered item what can sometimes represent a challenge.
In an attempt to overcome the limitations of the collaborative filtering and content-based filtering techniques, hybrid approaches, trying to combine both of them, have been introduced. The works of Adomavicius et al. [21] as well as, more recently, Lu et al. [12] reviewed different methods used in the field of RS, highlighting their pros and cons and giving insight into the future developments in the field.
Knowledge-based recommender systems [25,26] can be efficiently used for recommending highly customized products (i.e. real estate or automobiles). Unlike classical methods such as CF or CB, KBRS looks to obtain explicit user requirements by the direct solicitation, allowing the user to have more control over the recommendation while building interactive feedback.
Context-aware recommender systems [24,27] rely on multiple sources of information to identify a certain context and to generate more accurate recommendations (e.g., recommending swimsuits instead of winter coats in summer).
Finally, demographic-based recommender systems [28] group users based on their available demographic attributes (i.e., age, gender, location), assuming that people within the same group (neighborhood) rate items similarly. This approach has originally been introduced to improve the quality of recommendations but, it has also proved to be useful for solving the cold start problem [29].
Let us now recall some recent works addressing the issue of next grocery basket recommendation. Yu et al. [30] introduced an efficient model, called Dynamic REcurrent bAsket Model (DREAM), based on recurrent neural networks. One of the main advantages of DREAM is that it is not only able to learn a dynamic representation of a user but also takes into account global sequential features among baskets. However, the original DREAM model of Yu et al. was designed to perform binary classification only. For each available product, the model generates a probability score accounting for the probability that this product will be included in the next basket purchased by a given customer. Nevertheless, DREAM cannot provide predictions in a multi-store (i.e. multi-class) context, consisting in predicting the store where the recommended product should be bought. Moreover, in their work, Yu et al. did not consider some important features such as product prices, product availability, and weekly specials offered in local stores. This motivated us to generalize the original DREAM model to a multiclass classification task to predict both whether a given product should be included in the customer's next basket and in which store the purchase should be made (for more details, see the Materials and methods section).
Che et al. [31] described a new prediction method using attention-based recurrent neural networks to detect and model both inter-and intra-basket relationships. The authors proposed to consider all available user's baskets to model his/her long-term preferences, whereas the intra-basket attention model was intended to act on the item level in his/her most recent baskets to predict the user's behavior and current short-term preferences. Through their adaptive attention mechanism, Che et al. were able to outperform state-of-the-art methods for next basket recommendation, although their method applies only in a binary classification context.
Faggioli et al. [32] used the recency factor to predict the consumer's next grocery basket applying a CF-based prediction method under a general top-n recommendation framework. To show the efficacy of their method, the authors compared it with some state-of-the-art CF models.
Content-based recommendations were also shown to be effective in the field of next basket and grocery coupon recommendation. In this context, Xia et al. [33] proposed a tree-based CB model for coupon recommendations. These authors streamlined the coupon selection process in order to personalize the recommendation and increase the clickthrough rate. Using the random forest and XGBoost classifiers, Xia et al. were able to improve the estimated coupon click rate from 1.20% to 7.80%.
Moreover, Prokhorenkova et al. [34] described and tested a new statistical method based on the Yandex CatBoost model to predict whether a given customer is sensible to purchase some selected products. Dou [35] considered real unbalanced shopping data from an e-commerce platform and used the CatBoost model to predict whether customers will buy or not some available products. The method proposed by Dou was able provide the prediction accuracy of 88.51%.
Lee et al. [36] proposed to use recurrent neural networks instead of collaborative filtering techniques to create a multi-period product recommender system related to an online food market. The system introduced by Lee et al. is able to recommend products by multiple periods in a time sequence. The authors showed that the proposed recommender system provided a higher performance in accuracy and diversity in a multi-period perspective than CF-based systems. Moreover, the proposed system also showed a robust behavior in terms of consumers' purchasing orders and repetitive purchase patterns.
Zheng and Ding [37] proposed a personalized recommendation system based on an Immersive Graph Neural Network (IGNN), which is intended to increase the marketing quantity of various commodities, to improve users' shopping experience, promote sales, and thus motivate the market development. The authors considered an immersive marketing environment using deep learning and graph neural network models. However, as suggested by the authors, the proposed recommendation system was not verified in practical applications. Thus, the impact of the presented model on real users was not assessed.
Finally, Tahiri et al. [6] have recently proposed to use both recurrent and feedforward neural networks that were combined to non-negative matrix factorization and gradient boosting trees in order to build intelligent grocery baskets for the users of the MyGroceryTour platform. Tahiri et al. considered different features and much less real customers (compared to our study) to describe the behavior of the MyGroceryTour users. Their best F-score result of 0.37 was obtained when their general prediction model was applied to an augmented dataset. However, in their work, Tahiri et al. did not perform any clustering analysis and did not consider different categories of customers. As we will see in the next sections, this kind of analysis is very important for improving the prediction performance. Moreover, Tahiri et al. did not compare the results generated by their DL model with those provided by traditional ML algorithms. Such a comparison is crucial when the data set at hand is rather small. Finally, the DL model introduced by these authors is not personalized as the same model architecture was used for all customers considered.
In their paper, Gupta and Shrinath [38] presented a Collaborative Filtering-based model tailored to overcome the cold start problem. To achieve that, the authors propose to compute the weighted sum of four different features. The first of them is the items rating obtained using Weighted Non-negative Matrix Factorization, followed by Affinity Propagation technique. The three other ratings are graph-related similarity measures based on the users metadata as well as on their purchasing habits. Gupta and Shrinath reported that their model outperformed the existing approaches based on Hit Ratio and Normalized Discounted Cumulative Gain.
Li et al. [39] suggested several novel metrics to measure the repetition/exploration ratio and performance of next basket recommender systems. They compared and analyzed the results of state-of-the-art next basket recommendation models on three real-world datasets. Their study was conducted with a focus on their new metrics in order to help illustrate the scope of the current state of research and explain the progress provided by the existing approaches as well as the reasons behind the achievements claimed by the studied methods. Li et al. indicated that future research on next basket recommendation should consider an analysis of repetition and exploration behavior to gain useful insights and help to design unbiased models.
Le et al. [40] proposed a framework to model user's basket sequences. Their hierarchical network model, called Beacon and based on an LSTM architecture, consists of three main components, taking as input a basket sequence and a correlation matrix. The basket encoder component produces correlation-sensitive basket representations after capturing intra-basket item correlations. The sequence of basket representations is then used as input for a sequence encoder to extract inter-basket sequential associations. The output from this component is associated with the correlation matrix, and both are used by the predictor component to produce the correlation-sensitive next basket. Therefore, Le et al. took into account the correlative dependencies between items to enhance the representation of individual baskets as well as the overall basket sequence.

Mygrocerytour website
MyGroceryTour is a Canadian grocery information platform available in English and French. The main purpose of MyGroceryTour is to provide users with up-to-date information on the best grocery deals offered by major grocery retailers in their area, allowing them to compare the available products and to create personalized weekly grocery lists based on the provided insights.
The main features of the MyGroceryTour platform are as follows. It allows users to: 1. Search and compare grocery deals in the user's favorite local grocery stores; 2. Create, save, manage and print weekly grocery shopping lists (see Fig 2); 3. Display a map of local grocery stores and pharmacies available for a given postal code or address; 4. Compare the price of a selected product in local stores over a 3-month period (see Fig 3); 5. Find popular Canadian coupons; 6. Display the optimal shopping path based on the user's shopping list (see Fig 4); 7. Receive email alerts when the user's favorite products go on sale; 8. Create personalized intelligent grocery lists following a recommendation by machine learning algorithms (see Fig 5). This recommendation is based on the user's purchase history, the availability of the user's favorite products and the weekly specials offered in local grocery stores and pharmacies.
A MyGroceryTour feature allow customers to compare prices of a selected product at different stores, making it much easier to identify real specials and opportunities. Customers have the possibility to change their search area depending on their geographical position and their needs while displaying the available grocery products (the 1 to 20 kilometer distance, from the user's home, can be specified). Moreover, the users of MyGroceryTour can easily create, manage and save their grocery lists, and then access them at any time.
While users cannot purchase items from retailers directly through the website, they can add products from different stores to their baskets. Once a weekly grocery list is organized, the system will recommend to the user the optimal shortest path starting at the user's home, passing by all selected grocery retailers or pharmacies, and ending at the user's home as well. An efficient algorithm for solving the Generalized Travelling Salesman Problem (GTSP) by Tasgetiren et al. [41] has been implemented by our team, taking into account the real-time local traffic information provided by Google Maps API and the geographical position of the closest stores belonging to selected retailers (several stores for a selected retailer can be available in a given area). The uniqueness of MyGroceryTour is due to the use of the intelligent recommender system allowing the registered users to get personalized weekly grocery recommendations based on the use of the Random Forest and extended RNN-GRU-based DREAM algorithms which yielded the best prediction results in our experiments (see the Results and discussion section). A test account with the following coordinates (login: test@test.com; password: 123456) has been set up. It can be used to test our ML-based recommender system integrated into the MyGroceryTour platform.

Data description
In this section, we present the dataset which was used in our study. We considered 831 users of the MyGroceryTour web platform with varying amounts of saved weekly grocery lists (varying between 3 and 99). All real data considered here were anonymized. The data are available at: https://drive.google.com/file/d/1q-LkWMx5ar-OGlPPLFwSDi-IbLe7ZaIo/view?usp= sharing. The data collection and analysis method complied with the terms and conditions for the source of the data. Grocery lists used in our experiments included grocery products the users planned to buy during a given week (the time period from January 2017 to June 2021 was covered). The following features (i.e, explanatory variables) from the original dataset have been considered in our experiments: • user_id (numerical): unique user identifier; • list_id (numerical): unique shopping list identifier; • product_id (numerical): unique product identifier; • category (categorical): category of the product; • price (numerical): the price of the product; • special (numerical): discount on the product (in %) compared to its regular price; • distance_avg (numerical): average distance between the user's home and all stores where the product was available; • availability (binary): availability of the product at different stores.
We completed this list of features by an additional total_bought feature that represents the total number of times a given product has been bought by all users.

Data normalization
Data normalization is a common practice and an important step in both unsupervised and supervised machine learning [42], and data mining [43]. Data normalization has been proved theoretically and empirically to be an essential step to obtain better predictions from a model [44][45][46]. Normalization allows one to bring all features to the same scale, making them mutually comparable, thus ensuring stabler learning process and providing better results for both clustering and supervised learning methods, and specifically for gradient-based algorithms. Prior to feeding the data to our models, we also applied a standardization method to our continuous feature (i.e., product's category), converting it into a numerical vector. We used the feature_hasher class from scikit-learn [47,48] to encode the category feature. This class takes strings as input and converts them into numerical vectors using a hash function.
In our study, we used two popular data normalization techniques: z-score and MinMax rescaling [49]. Z-score normalization is a rescaling of data so that the normalized data have a mean of 0 and a standard deviation of 1 (Eq 1): where z(x f ) is the normalized value, and x f is the observed original value of feature f at a given observation, μ f is the mean of f, and σ f is the standard deviation of f. The MinMax normalization is carried out using the following formula (Eq 2): where x 0 f is the normalized value and x f is the observed original value of feature f at a given observation, min(x f ) is the minimum value of feature f over all observations, and max(x f ) is the maximum value of feature f over all observations.

Clustering methods
Clustering is part of data analysis aiming at finding homogeneous groups of objects in data. Clustering algorithms are divided according to input data formats and output cluster structure formats. A generic data format is the so-called object-to-feature matrix X = (x if ), in which the rows x i (i = 1, 2, . . ., N) correspond to given objects (customers in our case) and columns f (f = 1, . . ., F) correspond to features characterizing those objects (e.g., product's price, product's rebate (if on special), product's category in our case). A generic cluster structure format is a partition of the set of objects in non-overlapping clusters S 1 , S 2 , . . ., S K . The number of clusters K must be 2 or more, but not too many, so that usually K � N and clusters are aggregate representations of the data matrix X.
Two data clustering methods, K-means [50] and Ward's [51] algorithms, have been applied in our study.
The cluster structure in K-means [50,52] is specified by a partition S of the set of objects into K non-overlapping clusters, S = {S 1 , S 2 , . . ., S K }. Each partition S is characterized by the list of objects belonging to each of its clusters S k (k = 1, . . ., K) and the cluster centroids c k = (c 1 , c 2 , . . ., c K ). The problem is to find a partition S = {S 1 , S 2 , . . ., S K } and cluster centroids c k = (c 1 , c 2 , . . ., c K ) that minimize the sum of squares criterion. The K-means algorithm follows the so-called alternating minimization scheme for finding a K-cluster partition that minimizes Criterion (3): where x if is the value of feature f at object x i , and c kf is the value of feature f at centroid c k . Starting with a random initial partition and a set of centroids c, it tries to find an optimal partition S that minimizes the sum of squares W(S, c) for a given c, and then finds the vector c 0 that minimizes W(S, c). The procedure is repeated till convergence, that is, till c 0 coincides with c. In practice, the method converges fast to a local minimum which depends a lot on the choice of the starting partition.
The Ward clustering algorithm [51,53] follows the so-called agglomerative hierarchical approach. At each step, this algorithm considers a current partition S = {S 1 , S 2 , . . ., S K } with K clusters and their centers c = {c 1 , c 2 , . . ., c K }, and merges two clusters, S k and S l , into a new cluster S kl = S k [S l , with its center c(k, l) = (N k c k + N l c l )/(N k + N k ), where N k and N l are the cardinalities of clusters S k and S l , respectively. The clusters to be merged are selected so that the increase in the value of Δ(k, l) (Eq 4) reaches its minimum over all k and l (such that k 6 ¼ l): Dðk; lÞ ¼ WðSðk; lÞ; cðk; lÞÞ À WðS; cÞ; ð4Þ where S(k, l) denotes the new partition with m − 1 clusters obtained from S by merging S k and S l (i.e. S kl = S k [S l ), and c(k, l) denotes the centroid of this new partition. The quantities Δ(k, l)'s are all positive because the value of Criterion (3) decreases as the number of clusters K grows, so that it becomes zero at K = N. It is not difficult to derive the following formula explicitly expressing Δ(k, l) through clusters being merged: where d(c k , c l ) is the Euclidean distance between centroids c k and c l . This formula shows that the square error criterion tends to merge those clusters whose centers are nearest and whose sizes are most unbalanced. The generic Ward clustering algorithm starts with a trivial partition consisting of all singletons being their center, and then merges one by one clusters with the lowest Ward distance (Eq 5) between them till all objects fall into the unique cluster comprising all of them.

Supervised machine learning algorithms
In this section, we present the main characteristics of supervised traditional machine learning and deep learning algorithms used and compared in our work. Their scikit-learn and PyTorch implementations were used in our computation experiments. The obtained results are presented in the Results and Discussion section. Importantly, all machine learning algorithms were applied in a personalized fashion, i.e., a separate machine learning model was constructed for each of the 831 real users considered in our experiments. Decision Trees (DT): Decision trees are hierarchical models based on a succession of simple decision rules [54]. Each decision tree comprises of a root, nodes, branches and leaves. Each node represents a test of a given attribute, while branches represent the outcome of that test. A decision is taken upon reaching a leaf that corresponds to the predicted class. The decision rules are inferred based on the training data, and the features. A popular approach to building a decision tree is the impurity minimization at each node based on the Gini impurity measure (Eq 6) that aims at reducing the probability of making errors during the classification. The Gini impurity measure is defined as follows: where Z is a learning ensemble containing K classes, k is a given class, and P k is the proportion of objects belonging to class k. Random Forest (RF): Random Forest is an ensemble learning algorithm processing several decision trees [55]. Each decision tree is built on a sub-sample of the training ensemble with replacement, following a meta-algorithm known as a bootstrap aggregation that aims at minimizing the variance and helping avoid the overfitting. The final decision for an observation is taken based on a majority vote between the outcomes of all decision trees. The main advantages of the Random Forest algorithm is that it is known to be resistant to potential outliers as well as to be easily parallelizable.
Gradient Boosting Tree (GBT): Gradient Boosting Trees are an ensemble learning method using decision trees as weak learners and gradient descent optimization (similarly to neural networks) to achieve the best solution for either classification or regression problems [56,57]. Unlike Random Forest, which relies on bagging, GBT is based, as indicated by its name, on boosting. The algorithm is iterative. It tries to minimize the loss function by sequentially fitting a new tree at each step and correcting the prediction error from the previous steps. There exist different implementations of GBT, and some of them often perform better than others in practice. In this study, we used the scikit-learn, XGBoost and Catboost implementations of the GBT algorithm [58][59][60]. Previous works in both classical and deep learning literature have shown that ensemble methods (boosting and bagging) of multiple weak learners can drastically improve the performance upon the baseline algorithm. Moreover, boosting tends to outperform bagging on datasets which contain uneven data coverage, hence our choice of XGBoost and CatBoost algorithms.
Naive Bayes (NB): Naive Bayes is the simplest form of a Bayesian network. This probabilistic approach is based on Bayes' theorem (Eq 7), defined as follows: where y is the class and X is the set of features. One of the main drawbacks of Naive Bayes is that it makes the strong assumption that all considered features are independent, which rarely occurs in real-life scenarios. Nonetheless, Naive Bayes has been known to provide competitive results in some cases, especially in spam detection and in sentiment analysis [61,62]. Support Vector Machines (SVM): A Support Vector Machine algorithm attempts to separate a given dataset using a hyperplane. While an infinity of different hyperplanes may exist for that task, SVM chooses the one maximizing the margin between representative observations belonging to each class. These observations are called support vectors [63]. SVM introduces the concept of soft margin to deal with outliers or non-linear data, and it permits the algorithm to choose a hyperplane while allowing a few mistakes to obtain a better final separation [64]. However, the data are often not linearly separable even when soft margins are used. In this case, it is possible to transform the data, considering a higher dimensional space which allows for a better class separation. This is achievable through the use of kernel functions such as the radial basis function (rbf) (Eq 8) defined as follows: where γ is the kernel function coefficient set by the user beforehand [65,66]. The choice of the most appropriate kernel function is usually guided by trial and error. Logistic Regression: Logistic Regression is a simple classification model using a logistic function (Eq 9) to model the probability of all outcomes of a single trial [67]. It is usually of the following form: where μ is a location parameter and s is a scale parameter proportional to the variance.

Multilayer Perceptron (MLP):
The perceptron is a binary classifier, and the simplest type of neural network [68]. A perceptron's classification is obtained by calculating the scalar product of the input data (x 1 , x 2 , . . ., x n ) and the weights (w 1 , w 2 , . . ., w n ), and by adding a bias b to the result. The perceptron then acts as a threshold function providing the final prediction (Eq 10): The training phase of the perceptron consists of finding the optimal values of the weights through an iterative process of comparing the expected output y to the predicted output y 0 until the algorithm converges for the whole dataset.
A Multilayer Perceptron (MLP) is an artificial neural network with an input layer, a hidden layer, and an output layer consisting of interconnected neurons (or perceptrons). Whereas a simple perceptron is only capable of performing binary classification and linear separation of the data, a Multilayer Perceptron is able to capture complex relationships and perform multiclass classification. The data are fed through the input layer, then processed through the hidden layer, while the final output layer gives the final decision. The MLP training phase is similar to that of a simple perceptron-it is also an iterative process aimed at finding the optimal vector of weights w by comparing the predicted class y 0 with the real class y. However, considering its more sophisticated nature and the presence of a hidden layer, the MLP relies on backpropagation to handle the errors during the training phase [69].
Proposed RNN-GRU model: A recurrent neural network (RNN) is a deep learning network designed to embed sequential time-dependent data. In this study, we used a gated recurrent unit (GRU) RNN architecture to represent users' baskets. Precisely, we generalized the DREAM (Dynamic REcurrent bAsket Model) model proposed by Yu et al. [30] to predict the next basket content. Moreover, we used some additional features such as product prices, product availability, and weekly specials offered in local stores, which were not considered by Yu et al. We applied some important modifications to the original DREAM model to adapt it to a multi-class classification (only a binary classification problem is considered in [30]). Specifically, we embedded each available product using an Embedding layer in PyTorch, which was concatenated with the rest of the features passed through a two-layer perceptron; thus, each product was represented by an augmented vector (see Fig 6 for a schematic view of the proposed model's architecture).
Precisely, our RNN architecture contained 2 GRU layers of 64 neurons each. The parameter optimization was carried out by the RMSProp optimizer in PyTorch. We selected the optimal learning rate using 5-fold cross-validation. To prevent overfitting, we trained the model with a drop-out rate of 0.1.
Since, in this work, we study the benefits of training a single model per user, no explicit user embedding needs to be constructed. Each basket b i was treated as an arbitrary permutation of products b i = {p i,1 , .., p i,n } encoded within the augmented feature space, which was then summarized by the GRU cells into h i,t . The hidden embedding h i,t acts as an implicit user representation, since it stores information regarding the user's shopping baskets.
To obtain the product's affinity score, i.e. the score proportional to the probability of the product p i,t to be included in a basket containing the products p i,1 , ..p i,t−1 , p i,t+1 , .., p i,n , we multiplied the product embedding matrix M by the hidden embedding h i,t (Eq 11): A higher score o i,t indicates that the user is more likely to purchase the corresponding item. The Bayesian Personalized Ranking (BPR) loss was used to approximate and maximize the following probability (Eq 12): where v denotes a positive item included in the basket b i , v 0 denotes a negative item not included in the basket b i , and σ(x) is the logistic activation function to map onto probability space. To implement this objective, we sampled a number of negative items that were absent in the current basket b i (in our experiment, the number of negative items was equal to the number of positive items in b i ), and maximized the expectation probability over all products and baskets (Eq 13): To process sequential data, for example for market basket recommendation, it is common to use recurrent prediction mechanisms (e.g. LSTM, GRU or vanilla RNN). While other alternatives exist, they either tend to underperform in recommendation tasks (e.g. Causal 1d convolutions), or require a large amount of good-quality data (e.g. self-attention and transformer mechanisms). Since our data does not require the prediction of extremely long sequences, using a GRU cell is a suitable design choice which balances out predictive performance with training speed.

Parameters optimization and cross-validation
Parameters tuning is a decisive step when building a machine learning model since most of the models are heavily reliant on their selected parameters and provide substantially better performance when properly optimized. Ignoring the optimization of parameters can lead to the selection of a sub-optimal solution at the end of the experimentation.
There exist several methods to optimize the model's parameters such as Grid Search, Random Search or Bayesian optimization [70,71].
Grid search takes a grid of parameters and carries out exhaustive testing with all parameter combinations in order to ultimately select the one that yields the best results for the data at hand. In this study, we used Random Search as the parameter optimization technique for the models listed in the subsection 3.5. Similarly to Grid Search, Random Search considers a grid of parameters and values. However, Random Search conducts trials on random combinations of parameters instead of performing exhaustive search. This allows one to use distributions instead of specific values for continuous settings and ensures a better time and resource management (the running time is not necessarily related to the amount of parameters/values as the number of parameter combinations to be tested can be fixed by the user). Random Search has been shown to outperform Grid Search in terms of both the results and the running time [70].
While using Random Search, we carried out cross-validation (k-fold) to ensure that the best selected model does not overfit the data [72,73]. K-fold cross-validation is a common model validation technique in machine learning which consists in dividing the data into k equally sized sub-samples. A single sub-sample is then retained for validation purposes as the model is trained on the rest of the data (i.e. on the remaining k − 1 sub-samples). This process is repeated k times using each sub-sample exactly once for validation. The final evaluation of the model is the average of the k results. In our experiments, we set k = 5 (which is a commonly used number of sub-samples) [74]. Using this methodology we were able to optimize the parameters of the 10 machine learning described in the subsection 3.5, making sure that the results presented in Tables 1-3, are not due to data overfitting and correspond to real-case scenarios.

Table 1. F-scores provided by supervised ML/DL methods for all users of the MyGroceryTour website as well as for each of the four identified clusters of users.
The best overall results are highlighted in bold.

MyGroceryTour recommender system
In order to determine the products to be recommended to a given user by the MyGroceryTour recommender system, we classify all the products based on the user's purchase history, current specials information and products availability in each of local grocery stores considered. In order to be able to classify the products efficiently, the use both positive and negative feedbacks is necessary. One personalized machine learning model (i.e. one model per user) is built and weekly updated in our system. While we consider the products bought by a given user as positive feedback, we regard as negative feedback all the products that were available to this user at the time of the order but not acquired by him/her. For an order of size S, if T is the total number of products available to the user at the time of the order, then the negative feedback N for that order is N = T − S.
Typically, N represents thousands of products, while S usually varies from 5 to 50. This difference in size between positive and negative feedback leads to unbalanced training data and may result in a significant loss in performance. Similarly to Xia et al [33], we decided to use an undersampling method to balance the user's data instead of considering negative feedback all available and disregarded items. Undersampling methods have proven to be efficient for both binary and multi-class classifications [75,76].
As the number of products recommended by the machine learning models is often greater than an average grocery list size, S u , calculated for a given user u, for the final recommendation only the S u items with the highest confidence scores were retained. Typically, the confidence score was calculated as the probability estimate for the predicted class for a given observation; it can for instance be obtained using the predict_proba function in scikit-learn.

Clustering analysis
As mentioned above, we first used clustering to identify the profiles of the users of MyGrocery-Tour. To do so, we considered the following features: • avg_price (numerical): the average price of products bought by a specific user; • avg_special (numerical): the average discount percentage on products bought by a specific user; • avg_list_size (numerical): the average size of the shopping list of a specific user; • pca_category (numerical): this feature accounts for the category of products selected by a specific user. Here, we built a 831 × 24 matrix (831 is the number of users and 24 is the number of available categories) reflecting the user's choice of different categories of products. Each value of this matrix represents the number of items of a given category acquired by a specific user. We carried out the PCA analysis to reduce the matrix dimension and to determine the percentage of variance accounted by the main principal axes. The first (principal) PCA axis accounted for 72.6% of the total variance, the second axis for 12.1%, whereas the variance explained by the remaining axes was negligible. We decided to keep for our clustering analysis a single transformed feature representing the product's category. The new transformed feature corresponds to the normalized values of the first principal axis. This allows us to give the same weight to all features considered by clustering algorithms.
• avg_fidelity_ratio (numerical): the average of the quantity-based fidelity ratio (QFR) and the price-based fidelity ratio (PFR) defined in Eqs (14) and (15), respectively. Here, avg_fr u = (QFR u + PFR u )/2, where u is a given user, avg_fr u is the average fidelity ratio, QFR is the quantity-based fidelity ratio and PFR is the price-based fidelity ratio.
The quantity-based fidelity ratio (QFR) and the price-based fidelity ratio (PFR) defined below are both meant to give insight on the customer's fidelity to his/her favorite store.
The QFR value close to 1 indicates that a given consumer tends to do his/her grocery shopping in the same (favorite) store, whereas the QFR value close to 0 indicates that the customer tends to do his/her grocery shopping in many different stores. The quantity-based fidelity ratio is defined as follows: where u represents a given user, n is the total number of stores where the user u (n 2 N � ) bought at least one product, X max,u is the total number of products acquired by the user u in his/her favorite store (i.e. where he/she made most of his/her purchases), and X total,u (X total;u ¼ X max;u þ P n i¼2 X i;u ) is the total number of products purchased by the user u over all the stores where he/she bought at least one product.
Similarly, the price-based fidelity ratio (PFR) depends on a total price of the products acquired by the customer in his/her favorite store. The price-based fidelity ratio is defined as follows: where u represents a given user, n is the total number of stores where the user u (n 2 N � ) bought at least one product, P max,u is the total price of all products acquired by the user u in his/her favorite store, and P total,u (P total;u ¼ P max;u þ P n i¼2 P i;u ) is the total price of all products purchased by the user u over all the stores where he/she bought at least one product.
The input data for clustering analysis consisted of a matrix of 831 observations (i.e. corresponding to the 831 selected users of MyGroceryTour) and 5 features. Prior to performing clustering, we normalized the data at hand. We tested both Z-score and MinMax normalizations. The results presented below have been obtained using MinMax normalization as it provided slightly better clustering results than Z-score. The clustering analysis was carried out using both the Ward algorithm [51], which is one of the most popular hierarchical clustering algorithms, and K-means [50], which is certainly the most popular partitioning algorithm, through their scikit-learn implementations. The default scikit-learn parameters of the Ward and K-means algorithms were used.
We used the popular Silhouette [77] and Davies-Bouldin (DB) [78] cluster validity indices to determine the number of clusters in our dataset.
The Silhouette width is defined as follows. Given a partition P of a data set X with N objects, the Silhouette width s(x i ), for object x i 2 X, represents the degree of correspondence between x i and the partition. The average distance from object x i to its cluster C k can be defined as follows (Eq 16): ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi and the distance to a nearest object in another cluster as follows (Eq 17): bðiÞ ¼ min ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The Silhouette width for an object s(x i ) is defined as the relative difference between a(x i ) and b(x i ) (Eq 18): The global Silhouette width value is then defined as follows (Eq 19): It represents the extent of consistency of partition P. The maximum value of s(P) corresponds to the "right" number of clusters.
The Davies-Bouldin index is the average similarity between each cluster C i for i = 1, . . ., k and its most similar counterpart C j . It is calculated as follows (Eq 20): where S ij is the similarity value between clusters, calculated as (d i + d j )/δ ij , where d i are d j are the the mean distances between the objects in cluster C i and C j , respectively, and the cluster centroids, and δ ij is the distance between the centroids of clusters C i and C j . The minimum value of the DB index corresponds to the "right" number of clusters. While the highest value of the Silhouette and the lowest value of the Davies-Bouldin indices were found for the solution with K = 2 clusters, we present here the most interesting solution found for K = 4 clusters (see Fig 7). This solution corresponds to the highest local maximum of the Silhouette and the lowest local minimum of the Davies-Bouldin indices (see Fig 8). We used the t-distributed Stochastic Neighbor Embedding (tSNE) [79] as a dimensionality reduction method to visualize the clustering solution provided by the Ward algorithm (see Fig 7). During our experiments, we used the perplexity of 30 and the learning rate of 925 as parameters for the tSNE method, whereas the tSNE initialization parameter was based on principal component analysis [80] in order to preserve the general shape of the data.
It is worth mentioning that the clustering solution provided by K-means (for K = 4 clusters; this solution is not presented here) was similar, but had a slightly more important cluster overlap, compared to that found by Ward. The four user profiles shown in Fig 7 are as follows: • Cluster 1 (in red in Fig 7) includes customers who are moderately sensible to specials and usually buy their groceries in the same store (i.e, have high fidelity ratios); • Cluster 2 (in green in Fig 7) is the most diverse cluster that consists of customers buying their groceries in different stores (i.e., have low fidelity ratios). The members of this cluster are usually, sensitive to specials; • Cluster 3 (in blue in Fig 7) comprises customers who usually purchase the same (or similar) products in the same store (i.e. have high fidelity ratios), almost not reacting to specials; • Cluster 4 (in yellow in Fig 7) includes customers who are very sensitive to specials and buy their groceries in the same store (i.e., have high fidelity ratios).

Application and comparison of supervised machine learning algorithms
To assess the performance of the 10 traditional machine learning and deep learning algorithms considered in our study, we used F-score, which is a popular and reliable metric used to evaluate classification methods [81][82][83]. F-score is the harmonic mean of the precision and recall. It is defined as follows (Eq 21): where the recall is defined as TP TPþFN and the precision as

Results
The F-score results for the traditional machine learning and deep learning algorithms considered in our study are presented in Table 1. In this table, the overall average F-score results (obtained over all 831 users of MyGroceryTour) are presented along with cluster performances.
We can observe that three algorithms stand out by outperforming the rest of the methods, providing the best F-score performance for at least one cluster of users. The best overall result consisting in F-score of 0.559 was yielded by our RNN-GRU model. This model also provided the best average results for the users from Cluster 2 (with F-score of 0.506) and those of Cluster 4 (with F-score of 0.597), whose behavior is the most difficult to predict. Random Forest returned the best results for the users of Cluster 1 (with F-score of 0.583), whereas the radial basis SVM provided the best results for the users of Cluster 3 (with F-score of 0.662; the behaviour of the users from this cluster was the easiest to predict). We can also notice that baseline algorithms such as Naive Bayes and Decision Tree consistently underperformed across all clusters.
These promising performance of the generalized RNN-GRU DREAM model, which learns a dynamic representation of a given user and captures global sequential characteristics existing among the user's baskets, best suited to personalized basket recommendation task. It is capable of modeling the behaviour of the most diverse group of users, i.e., those forming Cluster 2, who buy their groceries in different stores and are sensitive to specials. Tables 2 and 3 present respectively the Recall and Accuracy results provided by the ML and DL algorithms considered in our study. These results are usually concordant with the F-score results reported in Table 1 as in both cases the RNN-GRU algorithm outperforms the other methods for the whole set of 831 users. Figs 9 and 10 illustrate the impact of the number of baskets and the average basket size on the prediction performance of Random Forest (the best traditional machine learning algorithm) and RNN-GRU (the best deep learning algorithm), respectively. We can observe that both Random Forest and RNN-GRU work best for users with high numbers of baskets (75 and greater), although the impact of the number of baskets is more important for Random forest (see Fig 9a).
On the other hand, a larger average basket size does not always results in a better prediction performance. For example, Random Forest (see Fig 9b) is less effective for users with an average basket size over 20 items than for those with an average basket size varying from 16 and 20 items. This could be due to complex relationships between items within the baskets. The performance of RNN-GRU seems to be less affected by the basket size, although this algorithm works better for users having more than 5 items in their baskets on average.

Conclusion
In this paper, we presented a novel personalized Recommender System included in the MyGroceryTour web platform, which is designed to suggest the best weekly grocery deals to Canadian customers. Our system applies the most appropriate ML or DL prediction model (see Fig 11) to provide a given customer with a weekly grocery list that suits him/her best as well as the list of stores in which the customer should purchase each product being recommended. Our system takes into account several features related to the customer's purchase history as well as features related to the current price and availability of products in local grocery stores. One of the advantages of our Recommender System is that it can recommend to each customer the products he/she has never bought before, which can be helpful to discover new relevant products or be aware of limited-time deals.
Our results demonstrate that different ML and DL methods should be applied for different clusters of users (see the results in Tables 1-3). To identify these groups of users according to their shopping behavior, we carried out two representative clustering algorithms, K-means (partitioning algorithm) and Ward (hierarchical clustering algorithm), which are known for their simplicity and speed. Our clustering analysis, conducted using Ward's algorithm, divided the entire group of 831 Canadian customers considered in this work into four clusters according to their shopping habits. In this study, we also introduced the average fidelity ratio feature used in our clustering analysis. This feature was defined as the average of the quantity-based fidelity ratio and the price-based fidelity ratio (PFR) introduced via Eqs (14) and (15), respectively. We then used different traditional machine learning and a new deep learning models to provide next basket recommendations. Interestingly, the average F-score values obtained for users from different clusters were quite different (see Table 1). They varied from 0.328 (for users of Cluster 2-who buy their groceries in different stores, and are sensitive to specials) to 0.543 (for users of Cluster 3-who usually purchase the same, or similar, products in the same store, and are not very sensitive to specials). We can also observe that some of ML methods were much better than others in recommending items for a specific cluster of users. Thus, it would be plausible to apply different prediction methods for different groups of customers: Random Forest for customers from Cluster 1, RNN-GRU for customers from Clusters 2 and 4, and SVM-RBF for customers from Cluster 3. Overall, the best results were provided by our RNN-GRU implementation. In terms of the average F-score, it outperformed Random Forest, the second best performing model, by 0.043. RNN-GRU also yielded the most consistent results across all clusters. The flowchart presented in Fig 11 provides a general overview of our Recommender System.
The superiority of the proposed RNN-GRU model indicates that in a grocery shopping context, temporal behaviour of the user, which reveals the user's dynamic interests at different times, and sequential characteristics of shopping baskets, which reflect interactions between all user's baskets over time, are two crucial prediction factors for next basket recommendation. Our promising prediction results can be explained by the nature of the data: indeed, grocery data are often very repetitive as users tend to buy a core of similar items (such as first necessity products) regularly, thus developing constant habits.
It is important to note that in terms of F-score our personalized RNN-GRU model outperformed the recent general LSTM-based model proposed by Tahiri et al. [6] by 0.339 when we used the new data available on the MyGroceryTour platform. Furthermore, for the augmented data considered by Tahiri et al., our F-score result was 0.189 higher than that of Tahiri and coauthors. The model introduced in our study is personalized (i.e. the model's parameters are tuned for each user). Specifically, our current model is equivalent to training a single aggregate model (as that of Tahiri et al.) for all users, and conditioning the inputs on the user embedding. Thus, in our current model, the implicit user embedding is the ground-truth one-hot vector. This explains its superior performance compared to the aggregate model of Tahiri et al. The LSTM models tend to be heavier for inference and training needs than GRUs, which is a limiting factor in our use-case. However, it is indeed possible to swap out one for the other in most practical setting, when the sequence length is not too large.   Table 4 reports the prediction performances of the most important recent ML and DL models used in the field of next basket recommendation. We can see that it's difficult to compare directly our results to those provided by most of the existing studies (as well as to compare the results of the existing studies among them) because most of these studies have been conducted using different datasets and different evaluation metrics. The only direct comparison can be done with the work of Tahiri et al. (2019), as these authors also analyzed MyGroceryTour data (see above). One of the main contributions of our study, in the addition to the use of clustering, is that we work in the multi-class (i.e. multi-store) classification context, while all previous studies considered the case of binary (i.e. one-store) classification, i.e. when a product can be recommended or not without suggesting the store where it should be bought (if recommended).
The Python implementation of all clustering and machine learning algorithms used in our work as well as the described anonymized 831-user data set are available in our GitHub repository at: https://github.com/Achrafb11/Smartshopping.
One of the limitations of our approach lies within the platform itself. Indeed, MyGrocery-Tour does not allow people to buy the products directly. Thus, we have no assurance that the users actually bought the items included in their grocery lists. We also cannot track stocks in different stores to potentially notify the users of shortages prior to adding products to their grocery lists. Our Recommender System is also sensitive to the cold start problem and it is not yet able to predict the exact quantity of each item recommended for inclusion to the user's next basket. We plan on addressing these limitations in our future work, in which we will also explore the impact of seasonality on grocery shopping habits, which could lead to improved recommendations as well.